bigdata_uclfandomcom-20200215-history
Similar items
'Part 1 : Slides' 'Part 2 : Project ' In this project, we try to find similar tweets in a database. We start with a .csv file including the tweet, the language, the date of submission, the localisation, etc. So, we need to extract the tweet and make a certain selection, because, we have more than 1 billion tweets. In the first part of this project, in order to reduce the number of tweets, we selected the tweets by their language and by their length. For example, in some parts of the project we choose to analyse tweets in english and with a length greater or equal than 15 characters. Indeed, it's more interesting to compare two tweets with more than 4 words. Our method to extract these messages is the following one: file=open(r"TweetsData.csv","r", encoding="utf8") destination=open(r"MessagesENSup15.csv","w", encoding="utf8") i=0 longueur=15 DateMax='2014-01-31 23:59:59' DateMin='2013-11-01 00:00:00' langue='en' #'fr' dateL=3 lang=12 k=0 Message=[] for line in file: contenu=file.readline() cut=contenu.split(',') taille=len(cut) if taille=DateMin): mess=cut1 j=2 while taille >longueur: mess='{0},{1}'.format(mess,cutj) j=j+1 taille=taille-1 if len(mess)<=15: #<=10: pass else: Message.append(mess) k=k+1 destination.write("%s\n" % (mess)) else: pass destination.close() file.close() Techniques: Similarity: A notion important to know before continuing is, the Jaccard similarity between two sets. It can be defined as: SIM(s,t)=\frac{(s \cap t)}{(s \cup t)} It's the ratio between the number of elements in common for the two sets and the number of elements in their union. The function used to obtain the jaccard similarity is the following def jaccard(X, Y): x = set(X) y = set(Y) return float(len(x & y)) / len(x | y) A first method to compare the tweets, is to transform the strings into set, and then, apply the equation of the jaccard similarity. You can observe some examples, in french, of pairs that have a similarity of 1 Examples: * 1st pair: tweet11="Je veux pas sortir de mon liiiiiit" tweet21=" Je veux pas sortir de mon lit" * 2nd pair: tweet12= "J'ai envie d'enfermer moumoune" tweet22= "J'ai envie de fumer de fouuuuu" * 3rd pair: tweet13="Je veux pas sortir de mon liiiiiiit" tweet23="Je les supportes vraiment pas les deux là" The first pair is really similar but not exactly the same, on the contrary of the second pair, which give a perfect score for two tweets totally different. In fact, this method compare all the letters in each string. So, if two strings have the same characters, then, there are said to be equivalent. Indeed, when we transform a string into a set, this last one contains once all the letters of the string, and in a random position. So, this method is not really helful in the search of real similar pairs. K-shingles: To avoid the problem we encountered just before, we can convert all the strings into k-shingles. Let's remember that a k-shingle is a string of k characters. Since the size of a tweet is quite small, we choose k=4 (in this project we have chosen to consider the spaces between words, but you could remove them). Let's take a example: Example: s="hello how are you" becomes : S='ello', 'llo ','lo h','o ho', ' how', 'how ', 'ow a', 'w ar', ' are', 'are ', 're y', 'e yo', ' you' You can see below our implementation to create the k-shingles: def kshingle(s,k): """ function which computes the k-shingles """ splitlist=list(s) kstringinter=[] for i in range(len(s)-k+1): liste=[] for j in range(i,i+k): liste.extend(splitlistj) mot=''.join(liste) kstringinter.append(mot) kstring=list(set(kstringinter)) return kstring We can compute the jaccard similarity on this result by calling kshingles(S,k). Now, we look after the k-shingles which are common to the two tweets. Example: tweet1: "je mange des papates, coucou le loup" tweet2: "Je mange des patates, coucou la poule" So, similarity= 0.7632 Applications of these thechniques: There are different ways to compare some messages, we can for example apply a many-one or a many-many algorithm. The many-one problem consist of taking one tweet and search in all the database, which tweets among it, are the most similar with it. The many-many problem consist of gathering all the tweets which have a high similarity. Many-one In the first case, we have a tweet and we want to find tweets that resemble our. This occurs, for example, when we want to know if there are people who write the same thing as us. In this case, we have no choice but to compare the tweet T to all others by the k-shingles technique. Example: In this part we selected only the tweets in english and with a length bigger or equal to 10 characters. Take the tweet T: "justin bieber i love you" Using the technique of k-shingles, we compare this tweet to the others with the Jaccard similarity. Here is an histogram of similarity of the other tweets with T. We didn't represent those with Jaccard similarity equal to zero. We clearly see that the most of tweets are textually different than T. In fact we found: * 1046372 tweets with a similarity between 0% and 1% * 15033 with a similarity between 10% and 20% * 1733 with a similarity between 20% and 30% * 438 with a similarity between 30% and 40% * 101 with a similarity between 40% and 30% * 56 with a similarity between 50% and 60% * 0 with a similarity greater than 60% Here some of them: * "little happy meal": 0% * "@Louis_Tomlinson I fucking love you to ??: 11% * "@_MIKEw_ i love you!": 24.2% * "@justinbieber follow me i love you": 38.5% * "@justinbieber love you xx": 40% * "@justinbieber love you...": 56% We can observe that even a Jaccard similarity of 55% is significant here. This method is very simple and the time of execution is correct, comparing one tweet with other million tweets had take about 30 seconds, We emphasize that in this project, we study the textual similarity between two tweets and not the content. Intermediate case In this section, we will be interested by an intermediate method, we take n tweets arbitrarily (the reference) and then, we compare them with the others. The random selection is described just below. We will then analyse the many-one and the many-many problem in the next part. def classementmessage(): file=open(r"MessagesENSup15.csv","r", encoding="utf8") import random n=264685; # number of tweets selected m=100; # number of reference messages listeBase=[] listeCompl=[] randomIdxList = random.sample(set(range(n)), m) k = 0; for line in file: mess = file.readline(); if k in randomIdxList: listeBase.append(mess); else: listeCompl.append(mess); k = k+1; file.close() return (listeBase,listeCompl) The histogram show the number of reference tweets for which the maximum similarity we found with the others from the database is a certain value. Let's note that we take randomly 100 messages for the reference and 264 585 others messages, all in english and with more than 15 characters. We can observe that a big part of the values are between 0.1 and 0.3, the similarity is not obvious. Here is somme examples of pairs for a certain rounded similarity, obtained via our algorithm: sim=0.2: I'm at Mom's Place http://t.co/ilvZIMamXN "I'm at Lander's ""cinema"" http://t.co/KlgZaTHXUn" sim=0.4: There's no way back There's no going back sim=0.5: I'm at Restaurace Nemesis (Praha) http://t.co/7mAP8WRo57 I'm at Restaurace Bella (Praha) http://t.co/IigUejWgCE or I give my all to you Ooooh ???????? ! I give my all to you ???? sim=0.8: @5SOS FOLLOW ME PLEASE ???? 35 @5SOS FOLLOW ME PLEASE ???? 19 sim=0.9: If I didn't care, I wouldn't have stuck around this long..... "If I didn't care, I wouldn't have stuck around this long." sim=1: #votearianagrande We can see that the tweets are really different and when the similarity is perfect, it's for the hashtags. With the same characteristics, we can make an other histogram including all the values of Jaccard similarity observed, so, not only the maximum value anymore. We can observe that the tweets are really not similar, the indices are about 0.1. So, we can really not talk about plagiarism between tweets! They are very different. The following histogram is realized by the same way, by only representing the maximum value of the Jaccard similarity for each tweets of reference. We take randomly 1000 messages for the reference and 577 180 other messages, all in french with more than 15 characters and from the 1st November 2013 to the 31th. January 2014. We don't take more tweets, because the time of computation is huge enough, about 3 hours for the last case. Many-many In this situation, we have a set of tweets and we want to find pairs of similar tweets. In this case, the number of pairs is too high for us to evaluate the similarity of each pair. So, we are going to implement other techniques. Other techniques In order to decrease the space needed to store a set, we could hash sets of shingles to four bytes each. The set representing a tweet is then the set of integers that are bucket numbers of one or more k-shingles that appear in the tweet. But even in this case, the space needed is still roughly four times the space taken by the tweet. Our goal is to find some pair which have a similarity above a threshold which we can choose. The LSH method, LSH for Minhash Signature In this project, we tried to implement the LSH technique for minhash signatures. This method, determines which candidates among all possible pairs to have great similarity. In this way, the pairs who have no chance to be similar are not taken into account, so the number of pairs to treat becomes feasible. Here, any pair that hashed to the same bucket for any of the hashings are "a candidate pair". When we use minhashing signature, an effective way to choose this selection is to divide the signature matrix into b'' bands consistings of ''r ''rows each. For each band, there is a hash function that takes vectors of ''r integers and hashes them to some large number of buckets. This method consist on the following steps: # Pick a value of k ''and construct from each tweet the set of k-shingles # Pick a length ''n for the minhash signatures. Compute the minhash signatures for all the tweets (minhash method is explained later) # Choose a threshold t that defines how similar documents have to be for being a candidate pair. # Construct candidate pairs by the technique of bands. # Examine each pair. Minhashing The "minhashing" compress tweets into small signatures and preserve the expected similarity of any pair of tweet. We can visualise a collection of sets as their "characteristic matrix". In this matrix, the columns correspond to the sets and the rows correspond to elements of the universal set. Here the code to calculate it: ********************************* Code Characteristic Matrix ''*********************************''' #function that returns the universal set of k-shingles def totShingles(tweets,k): totShingle = [] for i in xrange(len(tweets)): tweet = tweetsi shingDoc = shingle(tweet,k) # gives the universal set of k-shingles for i in xrange(len(shingDoc)): if shingDoci not in totShingle: totShingle.append(shingDoci) return totShingle #construction of characteristic matrix def caractMatrix(tweets, k): shingles=totShingles(tweets,k) matrix=sparse.csr_matrix((len(shingles), len(tweets))) for i in xrange(len(shingles)): for j in xrange(len(tweets)): if shinglesi in tweetsj: matrixi,j=1 return matrix For example, if we have the two tweets as "hey" and "hi", then taking k=2 we obtain the following characteristic matrix: [[ 1. 0.] [ 1. 0.] [ 0. 1.]] The similarity between sets is conserved with the minhash signature. This is because of an important property: "The probability that the minhash function for a random permutation of rows produces the same value for two sets equals the Jaccard similarity of those sets". To show this property, we tested with the following signature matrix (on the left of the page), using 2 hash functions (h1(x)=(3x+1)mod5, h2(x)=(x+1)mod5, and we obtained the following results (matrix on the right of the page): [1 0 1 [1. 0. 1. 0 1 0 [ 0. 1. 0. 0.]] 0 0 1 0 1 1 0 1 0] We can see that the first and the fourth columns are almost equals, that similarity is almost the same for the first and the last columns of the minHash signature. The second and the third columns aren't very similar, this in the same in the second and third columns of the minhash signature. Taking more hash functions, the property should be respected. Although, when we try with some random tweets, the property is not respected. This is probably due to an error in the implementation but we don't understand where. For example, using 100 hash functions for 2 tweets that aren't similar (similarity 0.1) we find that the minhash signatures had a similarity of 0.4 which is greater than the real similarity. Tweet1: @EFA_Patients the 1st mention of the #EUCOPDBrx13 hashtag appears on your TL. Now is Trending Topic in Belgium! #trndnl Tweet2: @VibeSurfShop @shedseven see you there mate ? ********************************* Code MinHash Signature '*********************************' def minHashSignature(tweets,k,n): matrixC=caractMatrix(tweets,k) #construction of characteristic matrix matrixSig=np.zeros((n,len(tweets)))+float('inf') nbrRows=matrixC.shape0 hashMatrix=np.zeros((nbrRows,n)) # simulation of permutation of rows for j in xrange(n): a=rnd.randint(0,nbrRows) #comme dans le book for i in xrange(nbrRows): hashMatrixi,j=(a*i+1)%nbrRows for j in xrange(len(tweets)): for z in xrange(nbrRows): if matrixCz,j 1: for i in xrange(n): matrixSigi,j= min(matrixSigi,j,hashMatrixz,i) return matrixSig LSH for minhash signatures ********************************* Code LSH for Minhash Signatures********************************* bands=50 rows=35 n=bands*rows t=0.70 def minHashLSH(threshold,matrixSig,bands,rows): lenghtBuckets=10000 arrayBuckets = [] for band in xrange(bands): arrayBuckets.append([[] for i in xrange(lenghtBuckets)]) candidatePairs = {} # clefs: paire de tweets, valeurs: leur vraisemblance i = 0 for b in xrange(bands): buckets = arrayBucketsb band = matrixSigi:i+rows,: for col in xrange(band.shape1): x= int(sum(band:,col)%len(buckets)) bucketsx.append(col) i = i+rows for item in buckets: if len(item) 2: pair = (item0, item1) if pair not in candidatePairs: A = matrixSig[:,item0] B = matrixSig[:,item1] similarity=cosineSim(A,B) if similarity >= threshold: candidatePairspair = similarity if len(item)>2: for perm in combinations(item,2): pair=(perm0,perm1) if pair not in candidatePairs: A = matrixSig[:,perm0] B = matrixSig[:,perm1] similarity= cosineSim(A,B) if similarity >= threshold and jacSim(tweets[perm0],tweets[perm1],k)>=0.5: candidatePairspair = similarity results= sorted(candidatePairs.items(),key=itemgetter(1), reverse=True) return results, candidatePairs m=minHashSignature(tweets,k,n) Take for example those tweets in french: tw1= "bonjour tout le monde" tw2= "bonjour tous" tw3= "quelle belle journée" tw4= '''"c est une jolie fille" Applying the code of LSH, choosing 50 bands of 35 rows, picking k=4 and a threshold of 70%, we found in 0.7 second the pair (tw1,tw2) as a candidate with a similarity of 95%,. This pair is the most similar one but its true similarity is 40%! This difference is probably due to the error in the implementation of the minhash signature that we don't understand. We tested this code with 1000 tweets in english, but in order to counteract the mistake of the minhash signature we add a condition for being a candidate pair: if similarity >= threshold and jacSim(tweets[item0],tweets[item1],k)>=0.5: With this condition, we found those candidate pairs with a true similarity as follows: "@teenageeeer i dunno go play hailie baby yo dad's busy" and "@teenageeeer i dunno go play hailie baby yo dad's busy" with a similarity 100%. @onedirection Beach Boy - God Only Knows #MidnightMemoriesSpotifyQuizQ7" and "@onedirection Beach Boy- God Only Knows #MidnightMemoriesSpotifyQuizQ7 xx" with a similarity 94.5% Because of the problem of the implementation of the minhash signature, we know that the result isn't very satisfactory. These aren't all the most similar pairs. For example, our code didn't find those tweets: "@Sarah_TOF the 1st mention of the #andben hashtag appears on your TL. Now is Trending Topic in Belgium! #trndnl" "@brien_x the 1st mention of the @zwartepeter hashtag appears on your TL. Now is Trending Topic in Belgium! #trndnl" Which are very similar. '''Difference between the textual similarity and the content The main goal of this project was to find textual similar tweets. This is different with finding tweets with the same content even if a pair of tweets with a high degree of similarity could have similar content. In order to compare those two things, we consider a coefficient TF-IDF (Term Frequency times Inverse Document Frequency). ''This coefficient allows to represent in some way how a word is representative of a tweet. This coefficient is computed as follows. Suppose we have N tweets, define f_{i,j} to be the number of occurrences of word ''i in tweet j''. The ''term frequency '' TF_{i,j} is: TF_{i,j}=\frac{f_{i,j}}{max_k f_{k,j}} that is, the numerator is the term frequency of term ''i in tweet j'' , normalized by its maximum number of occurrences of any term in the same document. Suppose term ''i appears in n_i of the N tweets in the collection. Then IDF_i=log_2(\frac{N}{n_i}) . The TF.IDF score for the term i'' in tweet ''j is defined by TF_{i,j} x IDF_i . The terms with the highest score are often the most representative of the topic of a tweet. In order to calculate this score, we decided to eliminate every "?" "!" in tweets and to consider only words with a length greater than 2 characters. ********************************* Code for the TFIDF********************************* #dictionary for every tweet def dictionnaire(tweets): dic = {} for tweet in tweets: mots=tweet.split(' ') for mot in mots: if len(mot)>3: if mot not in dic: dicmot = 1 else: dicmot += 1 return dic #dictionary for all the tweets def miniDict(tweet): dic={} paroles=tweet.split(' ') for mot in paroles: if len(mot)>1: # >2 if mot not in dic: dicmot=1 else: dicmot+=1 return dic def calculmaxTweet(tweet): minid=miniDict(tweet) maxi=max(minid.values()) return maxi mat=np.zeros((len(dicTot),2)) #computing the TF.IDF score for i in xrange(len(dicTot)): z=motsi #mot clefs tw={} lis1=np.zeros(len(dicTot)) for j in xrange(len(tweets)): valMax=0 importants=j tweet=tweetsj minid=miniDict(tweet) maxi=float(calculmaxTweet(tweet)) if z in minid: num=float(minidz) val=float(num/maxi)*np.log2(N/ni) TFIDFi,j=val else: TFIDFi,j=float(0) In order to calculate this coefficient, we have to start by create a dictionary for every tweet. Then we construct a dictionary for all the tweets. Here an example: Tw1: "ciao bella" Tw2: "ciao mamma" We find the following TF.IDF matrix: With those elements we can compute the TF.IDF score. [0 1.58 0] Hence, the key word of tw1 is "bella" and the key word of tw2 in "mamma". We calculated the TF.IDF for 40 tweets in french, for every row we calculate the similarity of the pairs with the same TF.IDF score. Wi find the graph in the left. We can see that the textual similarity is not correlated with the similarity of the content of tweets. This is because we can have a pair of tweets that talk about the same thing but totally differently, or a pair that talk in the same way but about different things. If we want to find pairs of tweets that talk about the same topic, we have to use other techniques. Conclusion In conclusion, when we work with big data, it is necessary to implement some techniques which enable to make a selection of the useful data. Indeed, the time of computation of the similarities is huge enough. Moreover the results are not very convincing, there is still a difference between the obtained values and the real ones. We finally, observe that the textual similarity of tweets doesn't necessarily imply the similarity of content. Reference Leskovec, J., rajaraman, A., Ullman, J.. Mining of Massive Datasets.